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ABSTRACT 


Dialect Identification is an important topic of research in Natural Language Pro- 
cessing (NLP) as it has broad implications in many real-world applications such as 
machine translation, speech recognition and chatbots to name a few. In this work, 
we investigate Tigrinya dialect identification using machine learning techniques. 
To that end, we have identified three Tigrinya dialects, namely: Z, L and D. Then 
we systematically collected datasets for each dialect. Finally, we perform exper- 
iments using classical machine learning and deep learning methods to quantify 
effectiveness of current methods on the problem of Tigrinya dialect identification. 
The highest overall accuracy of 92.98% was achieved using character-level Con- 
volutional Neural Networks (CNNs). 


] INTRODUCTION 


Tigrinya is the official regional language of Tigray, in the northern part of Ethiopia. It is also among 
the official national languages of Eritrea [Gebregzabiher| (2010). In this work, we define the dialect 
identification problem as a multi-class classification task. Given a chunk of Tigrinya text, we train 
models that determine which Tigrinya dialect the text is written in. Our main contributions are: (1) 
unified categories of Tigrinya dialects and (ii) annotated dataset for the study of Tigrinya dialects. 


2 RELATED WORK 


Tigrinya Dialects: Previous works have categorized Tigrinya dialects, mainly using geography of 


speakers and characteristics of the different dialects. |Weldezgu Mehari| (2021) listed three Tigrinya 


dialects namely northern, central and southern dialects whereas only two dialects (Tigray and Er- 
itrea) are mentioned as common dialects in [Feleke] (2017). compared Wajerat 
Tigrinya with mainstream Tigrinya, and stated that Tigrinya has multiple varieties that broadly dif- 
fer from one another, not only from one zone to the other but within one zone as well. We have 
synthesized three dialects: Z, L, D based on these previous works that had concrete data (examples 
and dialect characteristics) to support each dialect. 

Dialect Identification: studied four dialects of Amharic. They ap- 
plied text independent back-propagation ANN, VQ (vector quantization), GMM and a composite 
of GMM and back-propagation ANN to categorize Amharic speakers into four dialects. They used 
MFCC (Mel Frequency Cepstral Coefficients) as feature extractor and the blend of GMM and back- 
propagation ANN that scored 95.7% accuracy. employed Levenshtein Algorithm to 
compute lexical distance and agglomerative clustering method to categorize Afaan Oromo linguistic 
varieties into six dialect clusters. 

Language Identification: conducted a comparative study of automatic language 
identification on Ethio-Semitic languages — Amharic, Ge'ez, Gurage and Tigrinya using character 
n-grams. The author compared Naive Bayes with Cumulative Frequency Addition and concluded 
that the latter performed better. examined the similarity and the mutual intelligibility 
between Amharic and Tigrinya using Levenshtein distance, intelligibility test and questionnaires. 
He found both Tigrinya varieties have almost equal phonetic and lexical distances from Amharic. 
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3 DATASET 


There are no available datasets to study Tigrinya dialects. In this section, we describe how we 
generated a dataset for the purpose of identifying Tigrinya dialects from text. Our approach is as 
follows: (1) We took sample sources, e.g., books, from each dialect when available then (ii) For a 
sample text from each source, we translated them to the other dialects with native speakers in the 
target dialect and (iii) Finally, we performed quality checks. 


Concretely our implementation is as follows: For the Z variety, we used snippets from the book Kilte 
Zantatat 2021). For the L variety, we used book chapters from Fato 
and Erqi Enderta (Solomon, 2020). These were then translated to Z and D varieties. For the 
D variant, we could not find a book. Instead, we collected data form two Facebook users, Akeza 
Awalom and Guraya Asadi Raya (2021), that consistently write in that 
variety. The collected posts were translated to Z and L varieties 1n a similar fashion. Our approach 
makes it possible to study the dialect identification on a corpus with similar content across dialects, 
avoiding effects such as author or category of text. 


This way we gathered a total of 2,964 source sentences from all three dialects. The source sentence 
breakdown per dialect is: Z (n=1125), L (nz1075) and D (n=764). These were then translated to the 
other two dialects resulting in a total of 8,892 sentences. 14 speakers in total that are native in the 
target dialect participated in the translation task. We make the dataset publicly availabld!] for other 
researchers to build up on. 


4 EXPERIMENTS AND RESULTS 


We cast the problem of identifying dialects as a multi-class classification task. As baselines, we 
used three classical text classification approaches namely, Naive Bayes, Linear SVM and ANN. We 
used character n-grams running from one to five(n=1- 5) as features. In addition to the experiments 
conducted on the classical text classifiers, we also conducted similar experiments with deep-learning 
approaches, namely: CNN, BiLSTM and combination of the two models (CNN-BiLSTM) using 
character sequences. We used NLTK?|along with Scikit-Leari|libraries for the classical machine 
learning algorithms. To train the deep learning models, we used Kerag*} with Tensorflow API as a 
backend. Table[I|shows the result of our experiments. Using ten-fold cross-validation, we showed 
that the CNN based model achieved 92.98% overall accuracy. 


Table 1: Weighted average Precision, Recall and Fl-measure as well as overall Accuracy of dialect 
identification for Tigrinya 


Precision Recall Fl-measure Accuracy 
NB 0.85 0.83 0.82 0.83 
SVM 0.87 0.87 0.87 0.87 
BiLSTM 0.88 0.88 0.88 0.88 
ANN 0.89 0.89 0.89 0.89 
CNN-BiLSTM | 0.91 0.91 0.91 0.91 
CNN 0.92 0.92 0.92 0.92 


In this work, we have demonstrated the application of machine learning techniques for dialect iden- 
tification in Tigrinya text. Our findings contribute to the ongoing efforts to standardize the language 
by incorporating all dialects. Furthermore, we believe that this research plays a crucial role in doc- 
umenting and preserving the language. In future work, we plan to expand the dataset and explore 
additional modalities, such as audio. We will also study the effect of dialect identification on a real- 
world machine translation application for Tigrinya to and from Amharic and English 


021). 


https://zenodo.org/record/78323294.ZDtFk9JBx3Q 


ttps://www.nltk.org/ 


B 


ttps://keras.io/ 


o 
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Source | Z L D 
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Figure 1: A random sample of text examples for different Tigrinya dialects. 


